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ABSTRACT 

This  paper  deals  with  model  validation  of  dynamic  systems  (with  vehicle  systems  being  of 
particular  interest)  that  have  multiple  time- dependent  output.  First,  we  review  several  validation 
methodologies  that  have  been  reported  in  the  literature:  graphical  comparison,  feature-based 
techniques,  PDF/CDF  based  techniques,  Bayesian  posterior  estimation,  classical  hypothesis 
testing  and  Bayesian  hypothesis  testing.  We  discuss  their  advantages  and  disadvantages  in  terms 
of  several  attributes:  applicability  to  different  types  of  models,  need  for  assumptions, 
computational  cost,  subjectivity,  propensity  to  type-I  or  II  errors,  and  others.  We  then  proceed 
with  the  most  important  attribute:  can  the  validation  method  provide  a  quantitative  measure  of  the 
goodness  of  the  model?  We  conclude  that  Bayesian-based  model  validation  frameworks  answer 
this  question  positively.  A  bootstrap  method  is  presented  that  obviates  the  need  to  assume  a 
statistical  distribution  model.  The  features  of  the  Bayesian  validation  framework  are  illustrated 
using  a  thermal  benchmark  problem  developed  by  Sandia  National  Laboratories  and  a  battery 
model  developed  in  the  Automotive  Research  Center,  a  US  Army  Center  of  Excellence  for 
modeling  and  simulation  of  ground  vehicle  systems. 


1.  INTRODUCTION 

Modeling  and  simulation  are  indispensable  tools  in 
engineering  design  and  development,  in  general,  and  vehicle 
systems,  in  particular.  However,  the  efficacy  of  this 
computer-aided  engineering  paradigm  depends  largely  on 
the  validity  of  the  utilized  models.  Verification,  validation 
and  accreditation  (VV&A)  deal  with  various  aspects  of  this 
challenging  issue.  In  brief,  verification  asks  the  question  of 
whether  the  mathematical  model  is  being  solved  correctly; 
validation  concerns  the  question  of  whether  a  model 
(assuming  that  it  is  being  solved  correctly)  is  an  adequate 
representation  of  the  “real”  physical  system  at  hand; 
accreditation  provides  certification  for  a  model  to  be 
exercised  within  a  well-defined  scope. 

In  this  paper,  we  consider  the  challenge  of  model 
validation.  Typically,  model  validation  entails  the 
comparison  of  numerical  predictions  (CAE  data)  to 
experimental  data  (test  data).  Clearly,  validation  is  a  highly 
contextual  process;  e.g.,  a  low-fidelity  model  may  be 


adequate  for  a  specific  application,  while  even  a  high- 
fidelity  model  may  fail  to  capture  nuances  of  natural 
phenomena.  In  addition,  the  decision  of  whether  a  model  is 
“good  enough”  is  almost  always  subjective  as  it  is  based  on 
human  perceptions  and  knowledge  that  may  be  incomplete. 
Moreover,  the  nature  of  the  system  being  modeled  and  the 
type  of  model  output  considered  can  vary  significantly.  In 
this  regard,  there  does  not  seem  to  be  a  “silver  bullet” 
approach  to  model  validation. 

This  paper  deals  with  model  validation  of  dynamic 
systems  (with  vehicle  systems  being  of  particular  interest) 
that  have  multiple  time-dependent  output.  The  remainder  of 
this  section  provides  a  listing  of  attributes  that  are  desirable 
for  validation  methodologies,  followed  by  our  classification 
of  existing  validation  methodologies,  along  with  their  brief 
descriptions. 
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1.1  Attributes  of  Validation  Techniques 

Validation  techniques  may  be  applied  across  a  wide  range 
of  engineering  systems.  We  identify  the  following  attributes 
that  should  be  considered  when  assessing  the  utility  of  any 
validation  technique: 

Applicable  to  scalar  data :  the  suitability  of  a  validation 
technique  to  be  applied  for  comparing  scalars.  A  scalar  is  a 
single  numerical  quantity  observed/calculated  in  one  or 
multiple  repeated  experiments/computations. 

Applicable  to  vector  data:  the  suitability  of  a  validation 
technique  to  be  applied  for  comparing  vectors.  A  vector  is  a 
finite  collection  of  scalars. 

Applicable  to  scalar  time  series:  the  suitability  of  a 
validation  technique  to  be  applied  for  comparing  scalar  time 
series,  comprising  a  sequence  of  scalars  recorded  at 
successive  time  points.  Unlike  scalar  and  vector  data,  time 
series  data  often  have  serial  dependence,  in  which  there  is 
statistical  dependence  between  a  value  observed  at  time 
point  b  and  the  value  observed  at  another  time  point  tj. 

Applicable  to  vector  time  series:  the  suitability  of  a 
validation  technique  to  be  applied  for  comparing  vector  time 
series  which  are  a  sequence  of  vectors  recorded  at  successive 
time  points.  Vector  time  series  can  be  considered  as  a 
collection  of  multiple  scalar  time  series;  consequently,  they 
too  often  have  serial  dependence. 

Consider  multivariate  correlation:  the  ability  of  a 
validation  technique  to  use  the  correlation  information  of 
multivariate  data.  Although  a  validation  technique  suitable 
only  for  univariate  data  could  be  applied  to  each  response  of 
the  multivariate  data,  the  validation  results  for  each  response 
might  be  in  conflict. 

Include  objective  criteria:  the  status  of  a  validation 
technique  to  have  objective  criteria  to  accept/reject  a  model. 
An  objective  criterion  is  developed  based  on  mathematical 
or  statistical  reasoning. 

Quantify  model  confidence:  the  ability  of  a  validation 
technique  to  provide  a  quantitative  assessment  of  the  validity 
of  the  model  in  terms  of  model  confidence.  For  example,  in 
hypothesis  testing,  the  null  hypothesis  is  set  up  to  support 
the  fact  that  the  computer  model  is  accurate.  Model 
confidence  is  the  probability  of  this  null  hypothesis  being 
true. 

Incorporate  SME  opinions:  the  ability  of  a  validation 
technique  to  utilize  information  provided  by  Subject  Matter 
Experts  (SME)  in  the  process  of  validating  a  computer 
model. 

Normality  assumption  independence:  the  independence  of 
a  validation  technique  on  the  use  of  normality  assumption 
for  the  distribution  of  either  test  data  or  CAE  data.  More 
generally,  it  is  desirable  that  a  validation  technique  does  not 
require  any  particular  distribution  model. 

Insensitivity  to  type-I  error:  the  insensitivity  of  validation 
results  to  the  type-I  error  level  specified  for  classical 


hypothesis  testing  validation  techniques.  Type-I  error  level, 
or  the  rate  of  type-I  error,  is  the  probability  of  rejecting  the 
null  hypothesis  when  it  is  true.  It  is  known  that  specifying 
the  type-I  error  at  different  values  can  lead  to  different 
validation  results  (i.e.  from  accept  to  reject  the  model)  [1]. 

Low  computation  cost:  the  time  needed  to  execute  the 
validation  technique. 

Sample  size  independence:  the  insensitivity  of  the 
validation  results  to  the  selection  of  sample  size.  Sample  size 
is  the  number  of  observations  in  a  sample  which  is  a  subset 
of  the  population.  Validation  results  should  be  similar  if  data 
of  different  sample  sizes  are  used. 

1.2  Categorization  of  Validation  Techniques 


Figure  1:  Categorization  of  validation  techniques 

Graphical  comparison:  validation  techniques  that  generate 
validation  results  from  the  plot  of  test  data  and  CAE  data. 
An  intuitive  approach  is  to  plot  experimental  measurements 
and  simulation  outputs  on  the  same  graph.  One  decides 
whether  or  not  to  accept  the  model  by  inspecting  the 
difference  between  the  two  data.  No  quantitative  measure  of 
the  difference  between  the  two  quantities  compared  is 
involved.  In  [2]  the  authors  superimposed  the  computer- 
simulated  deformation  curve  onto  the  experimental  curve 
image  taken  by  a  high  speed  camera,  qualitatively  compared 
the  shape  of  the  curves  and  stated  that  the  two  curves  have 
good  correspondence.  In  [3]  the  authors  plotted  the  test  data 
as  x-coordinates,  and  CAE  data  as  y-coordinates.  If  the  two 
data  agree  with  each  other,  the  collection  of  all  the  data 
points  plotted  should  form  a  line  of  a  unit  slope  (base  line). 
Error  bounds  are  formed  by  drawing  two  lines  parallel  to  the 
base  line.  If  two  computer  models  are  compared  using  this 
plot,  one  model  would  be  preferred  if  considerably  fewer 
points  are  outside  the  error  bounds.  Similar  examples  of 
graphical  techniques  can  be  found  in  [4,  5].  Graphical 
comparison  may  be  subjected  to  reader  misinterpretation 
because  of  unknown  underlying  data  structure  [6],  and  can 
be  biased  and  subjective  [7]. 
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The  application  of  graphical  comparison  is  limited  to 
scalar  and  univariate  time  series  as  it  cannot  handle  the 
correlation  structure,  although  it  can  be  applied  to  each 
response  of  multivariate  data.  Graphical  comparison  lacks 
objective  rejection  criteria  as  it  is  often  based  on  subjective 
judgment  (past  experience,  SME  opinions,  etc). 
Additionally,  it  does  not  quantify  the  model.  SME  opinions 
can  be  coupled  with  graphical  comparison.  For  example,  the 
acceptance  region  on  the  graph  can  be  set  up  based  on  inputs 
from  SME’s.  There  are  neither  issues  associated  to  type-I 
error  nor  to  sample  size  since  graphical  comparison  is  not 
based  on  hypothesis  testing.  Computational  cost  is  minimal. 
Graphical  comparison  is  best  used  as  a  supplementary  tool 
together  with  other  validation  techniques. 

Feature-based  techniques’,  validation  techniques  that  draw 
validation  conclusion  based  on  the  difference  between 
features,  e.g.,  magnitude,  shape  and  phase  of  a  scalar  time 
series.  Several  magnitude-only  error  metrics  such  as  mean 
absolute  error  (MAE)  and  the  root  mean  square  error 
(RMSE)  are  discussed  in  [6]. 

The  Sprague  and  Geers  metric  (SG)  [8],  Knowles  and 
Geer's  metric  (KG)  [9]  and  Russell’s  metric  (R)  [10]  are 
similar  metrics  that  address  the  assessment  of  magnitude  and 
phase  error  simultaneously.  The  EARTH  metric  [11] 
evaluates  three  features:  phase,  magnitude  and  topology, 
where  topology  is  the  shape  or  slope  of  a  scalar  time  series. 
Discrepancy  in  phase  (both  global  and  local  timing  error)  is 
removed  by  shifting  the  time  history  before  analyzing  the 
magnitude  and  topology  errors.  Local  timing  error  is  taken 
care  of  by  the  use  of  dynamic  time  warping  (DTW).  Unlike 
KG,  SG  and  R  metrics,  there  is  no  comprehensive  form  of 
the  EARTH  metric  (i.e.  a  single  number  that  summarizes  all 
the  validation  results  for  different  features).  When 
evaluations  from  subject  matter  experts  (SME)  are  available, 
a  regression  is  performed  to  generate  comparable  ratings. 

Feature-based  techniques  do  not  require  a  distribution 
assumption.  Their  application  is  limited  to  scalar  and  scalar 
time  series  as  they  cannot  handle  the  correlation  structure  of 
a  vector  or  vector  time  series.  This  limitation  can  be 
removed  by  the  use  of  dimensionality  and  correlation 
reduction  techniques.  Feature-based  techniques  lack 
objective  rejection  criteria.  Model  confidence  is  not 
quantified.  SME  opinions  can  be  incorporated  (see  [11]  for 
an  example  of  building  regression-based  validation  models 
using  SME  opinions).  There  are  neither  issues  associated  to 
type-I  error  nor  to  sample  size  since  feature-based 
techniques  are  not  based  on  hypothesis  testing. 
Computational  cost  is  low. 

PDF/CDF-based  techniques’,  validation  techniques  that 
draw  validation  conclusions  based  on  the  distance  between 
the  probability  density  function/cumulative  density  function 
of  test  data  and  CAE  data.  Non-deterministic  test  data  and 
CAE  data  are  considered  as  random  variables.  In  [12]  the 


authors  examined  whether  or  not  the  deterministic  scalar  test 
data  are  within  the  highest  density  region  (HDR)  of  the  PDF 
of  the  CAE  data.  In  [13]  the  authors  developed  a  maximum 
horizontal  distance  between  the  two  CDF’s.  The  selection  of 
a  rejection  criterion  is  subjective.  Similarly,  the 
Kolmogorov-Smimov  statistic  measures  the  vertical  distance 
between  the  two  CDF’s.  If,  however,  the  data  have  a  very 
small  variability  (almost  deterministic),  the  vertical  distance 
could  be  very  large  even  though  the  two  CDFs  are  very 
similar  when  their  distance  is  measured  horizontally. 

Another  measure  of  the  distance  between  CDF’s  was 
developed  in  [14],  where  the  area  between  the  two  CDF’s 
was  suggested  as  a  validation  metric.  It  was  argued  that  the 
area  metric  enjoys  several  advantages  such  as  ease  of 
interpretation,  objectiveness  and  ability  to  express  validation 
results  in  terms  of  physical  units.  The  CDF  of  the  CAE  data 
is  assumed  to  be  known.  The  authors  suggested  that  this 
CDF  be  obtained  by  solving  the  mathematical  model 
analytically  or  by  propagating  a  large  number  of  replicate 
samples  via  Monte-Carlo  simulation.  The  test  data,  on  the 
other  hand,  is  usually  provided  as  a  collection  of  point 
values  in  a  data  set.  The  empirical  cumulative  distribution 
function  (ECDF)  was  used  to  describe  the  distribution  of  the 
test  data.  The  authors  illustrated  that  this  area  metric  is  better 
than  those  based  solely  on  the  mean  or/and  variance  of  the 
data  as  it  was  able  to  detect  the  difference  when  the  mean 
and  variance  of  observations  are  matched  but  the  distribution 
isn't.  When  applied  to  scalar  time  series  data,  the  u-pooling 
method  was  developed  to  pool  all  the  observations  together 
and  use  statistical  tests  (e.g.  Kolmogorov-Smimov  test)  to 
evaluate  the  accuracy  of  the  model  since  the  pooled  points 
should  form  a  uniform  distribution  if  test  data  match  CAE 
data.  The  threshold  value  was  not  provided  since  the  authors 
consider  it  as  the  task  of  decision  makers.  In  the  u-pooling 
method  the  CAE  data  distribution  is  assumed  to  be  known 
but  in  practical  this  is  often  not  the  case. 

In  [15]  the  author  proposed  a  discretized  version  of  the 
area  metric  and  gave  the  flexibility  to  reflect  what  portion  of 
the  ECDF  to  be  emphasized  for  comparison.  In  [16]  the 
authors  used  the  Anderson-Darling  test  statistic  as  a  measure 
of  the  discrepancy  between  two  CDFs.  The  Anderson- 
Darling  test  uses  a  weighted  quadratic  ECDF  statistic  to 
measure  the  distance  between  the  two  CDF’s  and  penalizes 
heavily  deviations  from  the  tail  portion  of  the  CDF.  It  was 
shown  that  the  Anderson-Darling  test  has  more  statistical 
power  than  the  Kolmogorov- Smirnov  test  [17]. 

PDF/CDF-based  techniques  do  not  require  a  distribution 
assumption.  Their  application  is  limited  to  scalars.  The  only 
implementation  for  scalar  time  series  is  the  use  of  the  u- 
pooling  technique  developed  by  [14].  PDF/CDF-based 
techniques  cannot  handle  the  correlation  stmcture  of 
multivariate  data.  Some  of  the  PDF/CDF-based  techniques 
have  objective  rejection  criteria  but  require  the  PDF/CDF  of 


Model  Validation  for  Simulations  of  Vehicle  Systems,  Pan  et  al.  UNCLASSIFIED 

Page  3  of  13 


Proceedings  of  the  2012  Ground  Vehicle  Systems  Engineering  and  Technology  Symposium  (GVSETS) 


the  experimental  SRQ  to  be  known.  Model  confidence  is  not 
quantified  by  PDF/CDF  based  techniques  as  only  a  measure 
of  the  distance  between  the  two  PDF’s/CDF’s  is  calculated. 
SME  opinions  can  be  incorporated  to  reveal  the  distribution 
of  either  test  data  or  CAE  data.  Issues  related  to  type-I  error 
do  not  exist  since  PDF/CDF-based  techniques  are  not  based 
on  hypothesis  testing.  Computational  cost  is  negligible. 

Bayesian  posterior  estimation  techniques’,  validation 
techniques  that  estimate  the  posterior  distribution  of  test  data 
and  CAE  data  using  the  Bayes  theorem.  Bayesian  posterior 
estimation  techniques  can  be  considered  as  a  combination  of 
feature-based  techniques  and  PDF/CDF-based  techniques,  in 
that  a  bias  function  is  used  to  quantify  the  discrepancy  in  the 
magnitudes  of  test  data  and  CAE  data,  and  a  Gaussian 
process  is  implemented  to  handle  non-deterministic  data. 
These  techniques  can  be  traced  to  [18],  where  a  Gaussian 
Process  was  used  to  model  the  test  data  and  CAE  data 
(scalar  time  series)  and  the  posterior  parameters  in  the 
Gaussian  process  were  inferred  using  Bayes’  theorem.  The 
authors  suggested  performing  normality  transformations  if 
the  data  is  not  normal. 

Bayarri  et  al.  (see  [19])  developed  tolerance  bounds  for 
model  predictions.  Their  perspective  of  validation  is  not 
simply  to  provide  answer  (yes/no)  to  the  question  whether  to 
accept  the  computer  model,  but  rather,  to  evaluate  the 
accuracy  of  computer  model  prediction  (CAE  data)  for  the 
intended  use. 

Higdon  (see  [20])  developed  posteriors  based  on  non¬ 
normal  priors  of  parameters  of  the  Gaussian  process  model. 
Chen  et  al.  [21,  22]  developed  posteriors  for  both  model  bias 
and  output  using  a  more  flexible  beta  distribution  prior. 
Tolerance  bounds  were  developed  for  validation  purposes. 
The  traditional  criterion  for  validation  is  that  the  model  is 
accepted  if  the  interval  of  the  model  bias  contains  zero  or  if 
the  interval  of  the  true  value  of  the  system  response  quantity 
contains  the  computer  model  output.  This  criterion  can  be 
problematic  since  it  tends  to  reject  the  computer  model  at 
regions  with  many  physical  observations  (and  thus 
prediction  intervals  are  narrow)  but  fails  to  reject  the 
computer  model  at  regions  with  few  or  no  physical 
observations  (and  thus  prediction  intervals  are  wide). 

Bayesian  posterior  estimation  techniques  are  dependent  on 
a  normality  assumption  since  a  Gaussian  process  model  is 
used.  Sample  size  has  a  significant  effect  on  the  width  of 
tolerance  bounds.  The  technique  is  limited  to  scalar  time 
series.  Bayesian  posterior  estimation  techniques  do  not  have 
objective  rejection  criteria.  Model  confidence  can  be 
quantified.  SME  opinions  are  incorporated  in  terms  of  prior 
distributions  of  the  parameters  of  the  Gaussian  process 
model.  Bayesian  posterior  estimation  techniques  are  not 
subject  to  issues  related  to  type-I  error  since  they  are  not 
based  on  hypothesis  testing.  Computational  cost  is  high  due 
to  the  use  of  the  Gaussian  process,  MCMC  and  MLE. 


Classical  hypothesis  testing  techniques’,  validation 
techniques  that  employ  a  defined  hypothesis  to  evaluate.  For 
non-deterministic  scalar  data,  the  t- test  is  used  to  assess  the 
similarity  between  the  means  of  test  data  and  CAE  data  [6, 
23,  24],  and  the  F-test  to  assess  the  similarity  between  the 
variances  [6,  23,  24].  Extension  to  vector  data  can  be 
achieved  by  using  Hotelling’s  T 2  -test  for  comparing 
multivariate  means  [25,  26],  and  Wilk’s  A-distribution  for 
comparing  covariance  matrices  [26,  27].  Multivariate 
hypothesis  tests  (hypothesis  test  that  is  designed  for  vector) 
limit  the  inflation  of  type-I  error  present  in  multiple 
univariate  tests  (hypothesis  test  that  is  designed  for  scalars) 
[28].  Normality  is  assumed  for  both  the  test  data  and  CAE 
data  in  all  these  hypothesis  tests  [23].  When  this  assumption 
is  not  valid,  transformation  to  normality  is  suggested  [24]. 
Alternatively,  the  bootstrap  method  was  suggested  to 
estimate  the  distribution  of  data  [26].  In  [24]  the  authors 
suggested  to  use  univariate  and  multivariate  tests 
collectively.  The  univariate  tests  can  yield  conflicting 
validation  results  but  can  identify  which  response  in  the 
multivariate  data  is  most  suspect.  Multivariate  tests,  on  the 
other  hand,  take  into  account  the  correlation  structure. 

A  method  closely  related  to  Hotelling’s  T 2  -test  is  the  r2 
method  developed  by  [29]  (referred  to  as  Mahalanobis 
distance  later).  The  r2  method  assumes  normality  and  the  r2 
statistic  follows  a  x2  distribution.  The  critical  value  is 
determined  as  the  cumulative  probability  of  a  /2  random 
variable  greater  than  the  given  significance  level.  The 
computer  model  is  rejected  if  the  probability  of  r2  being 
greater  than  the  critical  value  is  less  than  the  significance 
level.  The  r2  method  is  applicable  for  both  scalar  and  vector 
data  and  takes  into  account  uncertainty  in  the  model 
parameter.  This  method  was  further  developed  by 
formulating  confidence  intervals  for  the  r2  statistic  [30].  It 
was  extended  to  non-normal  data  by  the  use  of  the  maximum 
likelihood  estimation  (MLE)  [31].  The  rejection  criteria  can 
be  determined  by  Monte  Carlo  simulation. 

Classical  hypothesis  testing  techniques  depend  on  a 
normality  assumption  except  for  the  modified  r2  method  in 
[31].  Classical  hypothesis  testing  techniques  are  of  the  point- 
null  hypothesis  testing  type  and  validation  results  are 
affected  by  sample  size  [28].  Application  to  time  series  is 
not  appropriate  because  of  the  serial  dependence.  Classical 
hypothesis  testing  techniques  have  objective  rejection 
criteria.  Model  confidence  is  not  quantified  because  classical 
hypothesis  testing  techniques  only  judge  whether  a  computer 
model  is  accurate.  SME  opinions  are  not  currently 
incorporated  but  can  be  useful  for  determining  the 
distribution  used  in  the  modified  r2  method  [31].  Classical 
univariate  hypothesis  testing  is  subject  to  accumulation  of 
type-I  error  when  applied  to  each  response  of  multivariate 
data.  The  choice  of  significance  level  has  a  substantial  effect 
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on  validation  results.  Computational  cost  is  low  except  for 
the  r2  method.  Classical  hypothesis  testing  technique  is  best 
used  for  validating  computer  model  generating  non- 
deterministic  scalar  or  vector  outputs  assuming  normality. 

Bayesian  hypothesis  testing  techniques’,  validation 
techniques  that  combine  classical  hypothesis  testing 
techniques  and  Bayes  theorem  to  update  validation  results 
based  on  available  data  and  SME  opinions.  Using  Bayes 
factor  [23,  32,  33],  the  authors  set  up  hypothesis  testing  to 
examine  whether  the  Bayes  factor  is  above  or  below  unity 
[34].  Normality  is  no  longer  required  but  can  be  used  to 
provide  an  explicit  expression  of  the  posterior  distribution. 
In  [35]  the  authors  treated  the  Bayes  factor  as  a  random 
variable  to  address  the  uncertainty  in  model  parameters.  In 
[24]  the  authors  transformed  non-normal  data  to  normal  and 
showed  how  the  transformation  helps  reduce  the  type-I  error. 
In  [36,  37]  multiple  data  sets  were  considered  by  assuming 
the  data  in  each  set  are  independent.  The  overall  Bayes 
factor  is  calculated  by  multiplying  together  the  individual 
Bayes  factors  for  each  data  set.  In  [38]  the  authors  derived 
model  confidence  based  on  Bayes  factor  and  claimed  to  be 
the  first  to  derive  explicit  expression  of  the  model 
confidence  for  Bayesian  point-null  hypothesis  testing. 

A  comparison  between  point-null  and  interval  based 
hypothesis  testing  was  made  in  [16,  39];  it  was  shown  that 
the  chance  of  rejecting  a  correct  model  increases  as  the 
sample  size  increases  for  point-null  hypothesis  testing. 

To  have  more  consistent  results,  a  Bayesian  interval-based 
hypothesis  testing  method  (BUT)  was  proposed  [38]. 
Bayesian  hypothesis  testing  techniques  were  demonstrated 
to  be  superior  to  classical  hypothesis  testing  because  both 
hypotheses  (null  and  alternative)  are  considered 
simultaneously  [35].  Similarly,  it  was  shown  that  the  -value 
used  in  classical  hypothesis  testing  can  engender  misleading 
results  [40]. 

Bayesian  hypothesis  testing  techniques  are  not  dependent 
on  a  normality  assumption  although  the  selection  of  a  non¬ 
normal  distribution  may  increase  the  computational  cost. 
Sample  size  does  not  have  a  significant  effect  on  Bayesian 
interval-based  hypothesis  testing.  Bayesian  hypothesis 
testing  techniques  have  objective  rejection  criteria  based  on 
model  confidence.  SME  opinions  are  incorporated  to 
determine  parameters  used  in  the  prior  distribution  of  the  test 
statistic.  Bayesian  hypothesis  testing  techniques  are  not 
subject  to  issues  related  to  type-I  error.  Computational  cost 
is  modest,  although  not  as  low  as  the  previously  described 
methods. 


Applicable  to  solar  daia 
Applicable  to  vector  data 
Applicable  [q  scalar  time  series 
Applicable  to  vector  time  series 
Consider  multivariate  correlation 
Include  objective  criteria 
Quantify  model  confidence 
Can  incorporate  SME  opinions 
Can  work  without  normality  assumption 
Insensitive  to  type-I  error 
Low  computational  cast 
Sample  size  independence 


Figure  2:  Attributes  of  validation  techniques 


2.  METHODOLOGY 

Dimensionality  reduction  techniques  are  used  commonly 
for  multivariate.  In  the  context  of  validation,  Principal 
Component  Analysis  (PCA)  was  coupled  with  the  method 
[40],  and  with  Hotelling's  -test  [12].  However  PCA  lacks 
the  ability  to  deal  with  non-deterministic  data.  BIH  was 
coupled  with  Probabilistic  Principle  Component  Analysis 
(PPCA)  to  remove  correlation  of  data,  reduce  dimensionality 
and  handle  non-deterministic  data  [41-43].  This  is  the  basis 
of  the  Bayesian  validation  framework  whose  process  is 
shown  in  Figure  3. 


Figure  3:  Bayesian  interval-based  hypothesis 
testing  coupled  with  PPCA 
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First  multivariate  test  and  CAE  data  are  obtained  from 
experiments  and  simulations.  PPCA  is  applied  to  the 
difference  between  test  and  CAE  data  to  obtain  the  reduced 
difference.  The  PPCA  transformation  matrix  is  a  function  of 
the  eigenvalues  and  eigenvectors  of  the  covariance  matrix  of 
the  difference  data.  A  latent  variable  model  is  established  to 
relate  the  difference  data  (observed)  to  a  corresponding 
vector  of  latent  (unobserved)  variables.  The  reduced 
difference  is  the  expectation  of  the  latent  variable.  The 
dimensionality  reduction  is  achieved  by  retaining  only  a  few 
of  the  largest  eigenvalues  so  that  the  resulting  reduced 
difference  data  represent  at  least  95%  of  the  variability 
information  in  the  difference  data. 

After  PPCA,  the  reduced  difference  data  is  uncorrelated. 
As  a  result,  various  validation  techniques  can  be  considered 
that  are  only  suitable  for  univariate  data  (scalar  or  scalar 
time  series).  The  Bayesian  hypothesis  testing  technique  is 
selected  here  as  it  is  the  only  technique  that  produces  model 
confidence  which  provides  a  quantitative  assessment  of  the 
goodness  of  the  model. 

Bayesian  interval-based  hypothesis  testing  is  performed  on 
the  reduced  difference  data.  The  test  examines  whether  the 
expected  value  of  the  reduced  difference  is  within  the 
integration  bounds  of  the  integral  of  Eq.  (2.1).  The  null 
hypothesis  is  that  the  expected  reduced  difference  is  within 
the  integration  bounds  (accept  the  model).  The  prior 
distribution  of  the  expected  reduced  difference  is  assumed  to 
be  Gaussian.  Its  posterior  is  obtained  by  applying  Bayes’ 
rule  to  update  the  prior  using  the  observed  data  (reduced 
difference)  and  a  Gaussian  model  with  mean  vector  p0  and 
covariance  A0.  The  model  confidence  is  calculated  as: 

*  ■  Lwmr  H0*"  ~ Po)) dv 

(2.1) 

The  model  confidence  is  the  probability  that  the  expected 
reduced  difference  falls  in  the  integration  bounds  with 
respect  to  its  posterior  probability  density  function. 

2.1  Integration  Bounds 

Model  confidence  was  shown  to  be  sensitive  to  the 
selection  of  the  integration  bounds  [41].  Here  two  methods 
of  selecting  the  integration  bounds  will  be  explored. 

Norm-based  integration  bounds :  As  illustrated  in  Figure  4, 
error  bounds  [-  e,  e]  are  symmetrically  set  up  around  the  test 
data  defined  as  the  maximum  allowable  deviation  from  the 
data: 

e  =  6||t||„  (2.1) 

where  IHI^  denotes  the  infinity  norm  or  maximum  norm  of 
the  test  data  and  e  G  Mmxl;  t  is  the  test  data,  t  G  Mmxn,  and 


m  is  the  number  of  responses  and  n  the  number  of 
observations  of  each  response. 

The  magnitude  of  e  is  chosen  to  be  some  fraction,  b ,  of 
the  Lqo  norm  of  the  test  data  based  on  intended  engineering 
applications  or  SME  opinion. 


Temperature  History  for  Configuration  1 


Figure  4:  Example  of  norm-based  integration 
bounds 


The  magnitude  of  the  integration  bounds  used  in  the 
calculation  of  model  confidence  is  calculated  using: 

£  =  abs(M_1WTe)  (2.2) 

where  abs(-)  returns  the  absolute  value.  The  matrix  product 
M-1Wt  is  the  same  transformation  matrix  applied  to  the 
difference  data  to  obtain  the  reduced  difference  in  the  PPCA 
transformation. 

Variability-based  integration  bounds :  Following  the 
procedure  outlined  in  [41],  the  integration  bounds  magnitude 
is  calculated  as  a  fraction  of  the  standard  deviations  of  the 
reduced  test  data: 

£  =  b^/  diag(£t)  (2.3) 

where  diag(-)  returns  the  diagonal  components  of  a  matrix 
as  a  vector,  and  b  is  determined  iteratively  by  considering 
only  the  covariance  of  the  reduced  test  data  in  Eq.  2.1.  Zt  is 
the  uncertainty  associated  with  the  test  data. 

2.2  Bootstrapping 

In  the  methodology  described  in  Section  2,  it  is  assumed 
that  the  reduced  difference  follows  a  multivariate  normal 
distribution.  Oftentimes,  this  assumption  may  not 
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necessarily  hold.  To  remedy  this,  a  bootstrap-based 
technique  was  developed  as  an  alternative  approach  to 
calculate  model  confidence  without  relying  on  distributions 
for  the  error  model. 

The  bootstrap  method  was  introduced  by  Bradley  Efron 
[44]  in  the  1980s;  the  primary  objective  was  to  calculate 
confidence  intervals  for  parameters  in  situations  where 
standard  methods  were  not  applicable  [45].  For  example, 
asymptotic  results  are  unacceptably  inaccurate  when  the 
number  of  observations  is  small.  Since  its  invention,  the 
bootstrap  method  has  been  applied  to  many  engineering 
fields  such  as  geophysics,  biomedical  engineering,  image 
processing,  environmental  engineering,  artificial  neural 
networks,  etc. 

The  bootstrap-based  technique  developed  for  the  research 
presented  in  this  paper  is  illustrated  in  Figure  5. 


Figure  5:  Bootstrapping  technique 

In  most  practical  applications,  the  number  of  resamples 
that  need  to  be  drawn  should  be  of  the  order  of  a  thousand 

[46] .  More  detailed  guidelines  on  choosing  are  provided  in 

[47] .  The  bootstrap  method  employed  here  is  of  the  non- 
parametric  type;  however,  parametric  bootstrapping  will  also 
be  considered  in  future  research  since  it  is  noted  in  [46]  that 
parametric  bootstrap  methods  can  be  more  accurate  than 
non-parametric  ones  when  the  sample  size  is  small.  In 
addition,  the  i.i.d.  assumption  for  the  samples  is  arguable; 
therefore,  we  will  also  consider  bootstrap  methods  designed 
for  dependent  data,  e.g.,  moving-block  bootstrapping  [46]. 


3.  THERMAL  BENCHMARK  PROBLEM 

In  this  section,  we  illustrate  the  presented  Bayesian 
methodology  for  quantifying  model  confidence  using  a 
benchmark  validation  problem  from  the  literature. 
Specifically,  a  thermal  benchmark  problem  [48]  was 
developed  for  a  model  validation  challenge  workshop  held  at 
Sandia  National  Laboratories  in  2006.  The  computational 
model  to  be  validated  is  a  one-dimensional  heat  conduction 
model  that  predicts  temperature  for  a  material  layer  of 
thickness  subject  to  a  specific  heat  flux  (Figure  6). 

k,(>cp  T(t  -  0)  -  r, 

K -  * 

x  =  0 

I - 

Figure  6:  Schematic  of  the  heat  conduction 
problem  [48] 

The  boundary  conditions  are  specified  flux  on  the 
face  and  adiabatic  on  the  face.  The  computational 

model  for  temperature  prediction  is  given  by: 


!□  q  =  o 
-H 

x  =  L 


(3.1) 

The  thermal  properties  and  ,  and  the  initial  condition 
for  temperature  are  prescribed  constants. 

Four  replicate  experiments  were  conducted  for  each  of 
four  configurations  (combinations)  of  thickness  ,  and  heat 
flux  magnitude  ,  on  the  face  (two  levels  for  each 

variable)  to  obtain  test  data.  The  values  of  and  in  each 
configuration  are  given  in  Table  1. 


Table  1:  Values  of  and  in  each  configuration  [48] 


Configuration 

Heat  flux,  (W/m2) 

Thickness,  (cm) 

1 

1000 

1.27 

2 

1000 

2.54 

3 

2000 

1.27 

4 

2000 

2.54 

All  the  experimental  data  are  provided  in  [48].  It  is 
assumed  that  there  is  no  measurement  error.  Graphical 
comparison  of  test  data  to  CAE  data  is  shown  in  Figures  7- 
10.  The  error  bars  indicate  the  maximum  and  minimum 
values  of  the  four  replicate  experiments. 
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Figure  7:  Graphical  comparison  of  test  and 
CAE  data  for  configuration  1 


Figure  8:  Graphical  comparison  of  test  and 
CAE  data  for  configuration  2 


Figure  9:  Graphical  comparison  of  test  and 
CAE  data  for  configuration  3 


Figure  10:  Graphical  comparison  of  test  and 
CAE  data  for  configuration  4 

3.1  Validation  Results 

There  are  7  published  solutions  to  the  benchmark  problem 
[49-55].  Each  of  these  approaches  fall  under  one  of  the 
categories  presented  in  Section  1.2  (see  Figure  11).  All  of 
these  approaches  yield  qualitative  assessments,  as 
summarized  in  the  authors’  own  words  in  Table  2). 

We  calculated  model  confidence  for  four  variations  of  the 
presented  Bayesian  validation  framework;  results  are 
presented  in  Figure  12.  ‘Norm-based  Bayesian’  refers  to  the 
method  that  employs  the  norm-based  integration  bounds 
introduced  in  Section  2.1,  and  calculates  the  model 
confidence  using  Bayes  factor.  ‘Norm-based  bootstrap’ 
refers  to  the  method  that  employs  the  same  norm-based 
integration  bounds,  but  calculates  the  model  confidence 
using  the  bootstrap  technique  presented  in  Section  2.2. 
‘Variability-based  Bayesian’  and  ‘Variability-based 
bootstrap’  differ  from  ‘Norm-based  Bayesian’  and  ‘Norm- 
based  bootstrap’,  respectively,  only  in  the  fact  that  the 
variability-based  integration  bounds  were  used  instead  of  the 
norm-based  integration  bounds. 

While  there  is  a  small  variation  in  the  results,  it  can  be 
concluded  that  for  this  benchmark  problem  i)  the  normality 
assumption  made  in  the  Bayesian  calculation  does  not  have  a 
significant  impact  on  model  confidence  quantification;  ii) 
validation  results  are  relatively  insensitive  to  the  technique 
for  determining  integration  bounds;  and  iii)  the 
computational  model  can  be  accepted  as  adequate 
representation  of  reality  since  confidence  is  well  above  50%. 
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Figure  11:  Categorization  of  solutions  to  the 
thermal  benchmark  problem 


Table  2:  Summary  of  validation  results  from  the  published 
solutions  to  the  thermal  benchmark  problem 


Liu  et  al.  [49] 

“Negligible  bias” 

Ferson  et  al  [50] 

“Mismatch” 

Higdon  etal.  [51] 

“Small  discrepancy” 

Hills  and  Dowding  [52] 

“Poor” 

McFarland  and  Mahadevan  [53] 

“Valid” 

Brandyberry  [54] 

“Equivalent  means” 

Rutherford  [55] 

“Inadequate” 

Norm-based  Norm-based  Variability-based  Variability-based 

Bayesian  bootstrap  Bayesian  bootstrap 


Validation  Approaches 

Figure  12:  Comparison  of  validation  results 


3.2  Statistical  Power 

Statistical  power  is  a  useful  tool  to  assess  the  robustness  of 
the  computed  model  confidence.  Formally,  it  is  defined  as 
the  probability  that  the  hypothesis  test  procedure  will  reject 
the  null  hypothesis  when  it  is  false  (i.e.,  the  probability  of 
not  committing  a  type-II  error)  [56]. 

For  the  thermal  benchmark  problem,  the  statistical  power 
of  the  Bayesian  validation  framework  is  significantly  higher 
than  that  of  classical  hypothesis  testing  (79%  vs.  11%). 
Bayesian  hypothesis  testing  supports  the  null  hypothesis 
directly  by  providing  the  probability  of  it  being  true,  while 
the  classical  hypothesis  testing  does  so  by  concluding  that 
there  is  not  sufficient  evidence  to  reject  the  null  hypothesis. 

The  factors  that  influence  statistical  power  are  the  type  of 
hypothesis  testing  used,  sample  size  and  the  distance 
between  the  test  statistic  (the  expected  reduced  difference) 
and  the  integration  bounds  bound.  Statistical  power  is  low  if 
the  integration  bounds  are  set  to  be  too  narrow,  or  the 
sample  size  is  not  large  enough.  Guidelines  can  be 
established  to  choose  the  ideal  sample  size  or  the  integration 
bounds  to  achieve  a  certain  level  of  statistical  power. 

4.  ONGOING  AND  FUTURE  WORK 

The  research  presented  in  this  paper  is  supporting  the 
activities  of  a  tri-service  Energy/Power  Community  of 
Interest  (E/P  Col)  for  providing  best  practice  guidelines  for 
model  sharing  and  verification  and  validation.  Current 
members  of  this  Col  include  the  Air  Force  Research 
Laboratory,  the  U.S.  Army  Tank  Automotive  Research, 
Development  and  Engineering  Center,  the  Navy  Surface 
Warfare  Center,  Carderock  Division,  the  Automotive 
Research  Center  at  the  University  of  Michigan  and  the 
Electric  Ship  Research  and  Development  Consortium  at 
Florida  State  University.  A  straw-man  model  [57]  has  been 
developed  by  the  Air  Force  Research  Laboratory  in  order  to 
be  used  as  a  testbed  for  the  Col's  activities,  and  an  electro¬ 
thermal  battery  model,  developed  at  the  Automotive 
Research  Center  at  the  University  of  Michigan  [58]  has  been 
integrated  in  the  straw-man  model  (Figure  13). 

As  a  first  step  towards  validating  the  straw-man  model, 
the  Bayesian  validation  framework  was  used  to  quantify 
model  confidence  for  the  electro-thermal  battery  model.  The 
model  confidence  is  99%,  indicating  good  match  between 
test  data  and  CAE  data,  which  is  consistent  with  the 
graphical  comparison  shown  in  Figures  14  and  15. 
Fragments  of  data  are  presented  for  clarity. 
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Figure  13:  Straw-man  model  [57] 


Figure  14:  Graphical  comparison  of  test  and 
CAE  data  for  battery  surface 
temperature 


Time  (min) 


Figure  15:  Graphical  comparison  of  test  and 
CAE  data  for  battery  terminal 
voltage 


5.  SUMMARY 

There  exist  various  validation  techniques  developed  for 
different  purposes  and  applications.  However,  there  are  no 
clear  formal  guidelines  for  using  these  techniques. 
Categorization  of  existing  validation  methods  is  thus 
essential  to  compare  them  systematically  in  order  to 
establish  suitable  application  domains  for  each  category.  We 
have  presented  such  a  categorization  in  this  paper  based  on 
several  attributes  that  highlight  their  advantages  and 
disadvantages.  The  Bayesian  validation  framework  was 
found  to  be  the  only  validation  technique  that  yields 
quantitative  (as  opposed  to  qualitative)  assessment  of  the 
goodness  of  a  model.  Based  on  this  finding,  we  implemented 
a  Bayesian  validation  method  and:  i)  developed  alternative 
techniques  for  determining  the  integration  bounds  used  for 
computing  model  confidence  and  ii)  incorporated  a 
bootstrap-based  technique  to  eliminate  the  need  to  assume 
any  distribution  model  for  the  data.  We  also  used  statistical 
power  to  assess  the  robustness  of  model  confidence, 
showing  that  Bayesian  hypothesis  testing  is  superior  to 
classical  hypothesis  testing. 
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